Fix test suite collection and stale expectations#23
Conversation
|
Hi CeeJay, and CJs agent(s), greetings! Welcome to gently. To set the larger context, we are developing and testing interactions between Augmented LLMs, Microscopes and Samples environments. At the moment, we are critical on shipping to users ryan and brie, who would be interacting with the system for their tasks with the DiSPIM. I would have a detailed disussison on this with them at some point, but with the discussisons so far, it seems like - 1. the system is used for imaging embryos - the illuminating objective uses the galvos to steer a pencil to a sheet, and then across the microscope, and as the sheet traverses the scope, we use the other collecting objective perpendicular to that which is moved up and down to match the speed of the galvo with the help of its piezos. for this calibration is done by ryan and bre -- they find the embryos individually in xy by looking at an overview camera in the bottom which has low mag to allow for a field of large enough to cover the entire sample area. then we raise the stage in z(or bring the head axis (F in 2015 paper A. Kumar, older implementation) closer to stage). Then they calibrate - with a procedure we have captured as calibraiton_tools.py where there area room for improvements - and can be easily tested in isolation for a given embryo xy coords and post brie.ryan f axis alignment. So can be tested in scale. And also seeks the urgency to automate the f drive focus finding - while lower the spim head into the sample focus is acquired reliably without crashing the spim head on the glassslide that exits if we go on the focus beyond the embryo which is sticking to the glass slide coated with poly lysine in a very small area that keeps the eggs from sticking and staying there, instead of floating away until hatching. so, once the calibration stuff is set, either by brie or ryan or gently at the moment, we decide on a timelapse. this is the technical layer of gently. refer to paper kesavan, pontus, 2026, journal of microscopy - the concept of the heirarchy of levels in microscope thought model - with an illustrated example in cancer migration, with figure. the core idea is that there is a planning layer facing technical implemnets perpective to look at data flow. where planning thought elemnets shape how a microscopeists works with any entity - therefore, i want to ask, if i can give the biologist that as their primary layer of interaction with any microscope, and specificlally our microscope - dispim, looking at an organism. for these reasons, I want to know how to shape our plannig implementation and improve its existing structures. It works quite well for what exists already, and has gotten us into plans that were admired by biologists across fields at janelia, so it should be okay to think that there is some quality to the existing system - and we want to improve it and take it to its further potential. inacluding poetnitally of including say other modalities - like robotic sample preparations (maybe that's too ambitious) but other modalities that goes into the inteeliggent act of runnign an experiment. I guess that's it for now. I want to see if you can make a single PR with these new information, and we can reiterate from there :) |
|
Thanks, I took this as a request for a single iteration PR using the new context you provided here. I read the Kesavan/Nordenfelt smart-microscopy framework and mapped the technical/experimental/theoretical/conceptual hierarchy into Gently plan mode in #32. That PR adds a first-class |
|
Thank you very much for this synthesis. Before we proceed further, two
questions:
1. How long does it take to run a plan synthesis call? How clean is the
ui/ux in runtime for a plan mode? This can be tested with a browser mcp,
while running gently without device layer, and using the browser mcp
interface to feed actions into the chat/interface making the plan and
giving it an intent and seeing the *quality* of the plan it produces.
2. How generalisable is the plan mode outside of dispim, celegans, and
imaging itself.
Best,
Kesavan
…On Sun, 31 May, 2026, 8:25 pm ceej640, ***@***.***> wrote:
*ceej640* left a comment (gently-project/gently#23)
<#23 (comment)>
Thanks, I took this as a request for a single iteration PR using the new
context you provided here. I read the Kesavan/Nordenfelt smart-microscopy
framework and mapped the technical/experimental/theoretical/conceptual
hierarchy into Gently plan mode in #32
<#32>.
That PR adds a first-class plan_context on plan items, persists it
through the context stores/templates, teaches the plan-mode prompt/tools to
use it, adds validation warnings for missing hierarchy/F-drive
calibration-safety context, and includes a short design doc for iteration.
—
Reply to this email directly, view it on GitHub
<#23?email_source=notifications&email_token=ABVNN4HGT5Y2DKJATGMT5WL45TEPHA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHA3DSMZYGA2KM4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4588693804>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVNN4BF7DWOHJAOGI46Z5L45TEPHAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBYGY4TGOBQGQ>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
|
Thanks, these are the right questions to ask before taking the planning work further. I would separate what I can infer from the code today from what still needs an actual browser/runtime measurement. 1. Plan synthesis runtime and plan-mode UI/UXI do not yet have a measured browser-MCP runtime number. Based on the code, plan synthesis is not one clean atomic call; it is a conversational LLM turn plus sequential tool calls such as My expectation before measuring:
The UI/UX is structurally present: A useful browser test should measure:
2. Generalisability outside DiSPIM, C. elegans, and imagingArchitecturally, plan mode is more general than DiSPIM/C. elegans/imaging. The strong general pieces are:
But the current implementation is still biased toward the original use case:
So my answer is: plan mode is conceptually generalisable, and #32 moves it in that direction, but it is not yet fully generalised as software. To support other microscopes, organisms, or robotic sample prep cleanly, Gently needs modality/organism-specific specs and validators to become plugin/capability-driven rather than hard-coded into plan-mode validation and research tools. The next useful step is probably empirical rather than another abstract planning PR: run the offline/browser plan-mode benchmark you described, record runtime and UX friction, inspect the generated plan quality, and then use those results to decide whether the next PR should target UI feedback, plan quality, or the generalisation boundary. |
|
Gotcha. would love to see you explore plan mode empirically. Let me know if
you run into any issues. Do let me know if you are able to set up gently
locally.
…On Sun, 31 May 2026 at 21:42, ceej640 ***@***.***> wrote:
*ceej640* left a comment (gently-project/gently#23)
<#23 (comment)>
Thanks, these are the right questions to ask before taking the planning
work further. I would separate what I can infer from the code today from
what still needs an actual browser/runtime measurement.
1. Plan synthesis runtime and plan-mode UI/UX
I do not yet have a measured browser-MCP runtime number. Based on the
code, plan synthesis is not one clean atomic call; it is a conversational
LLM turn plus sequential tool calls such as create_campaign,
create_plan_item, link_plan_items, propose_plan, and sometimes
literature/strain/lab-history lookups.
My expectation before measuring:
- Simple plan, no literature/database lookups: likely tens of seconds
to around 1-2 minutes.
- Literature-grounded, multi-phase plan with strain search/read-paper
steps: likely several minutes.
- Running without the device layer should remove microscope/device
risk, but it does not remove LLM/API/tool latency.
The UI/UX is structurally present: /plan enters plan mode, chat drives
synthesis, and the Plans tab has document/board/graph/timeline views for
inspection. I would not yet call the runtime UX clean without doing exactly
the browser test you suggest. The likely weak points are flow issues rather
than the model objects: whether users understand that planning is
happening, whether they get enough progress feedback while waiting, whether
the generated plan appears clearly in the Plans tab, and whether refinement
feels natural.
A useful browser test should measure:
- time to first useful plan-mode response
- time from intent to visible structured plan
- number of turns needed before propose_plan
- whether plan items include controls, decision points, dependencies,
specs, references, and now plan_context
- whether the Plans UI makes the result inspectable without relying on
chat history
2. Generalisability outside DiSPIM, C. elegans, and imaging
Architecturally, plan mode is more general than DiSPIM/C. elegans/imaging.
The strong general pieces are:
- Campaign / phase hierarchy
- typed PlanItems: imaging, bench, genetics, analysis, decision point
- dependencies, status, snapshots, references
- generic lab-history querying
- organism/hardware injection via get_organism() and get_hardware()
- the new PlanContext, which is intentionally modality-agnostic:
technical, experimental, theoretical, conceptual context plus constraints
and operator/sample context
But the current implementation is still biased toward the original use
case:
- ImagingSpec encodes embryo/timelapse/light-sheet-style assumptions.
- search_strains is C. elegans/WormBase/CGC-specific.
- validation still has DiSPIM-ish hardware limits and C. elegans stage
logic.
- prompt examples and quality expectations are strongest for C.
elegans embryo imaging.
So my answer is: plan mode is conceptually generalisable, and #32
<#32> moves it in that
direction, but it is not yet fully generalised as software. To support
other microscopes, organisms, or robotic sample prep cleanly, Gently needs
modality/organism-specific specs and validators to become
plugin/capability-driven rather than hard-coded into plan-mode validation
and research tools.
The next useful step is probably empirical rather than another abstract
planning PR: run the offline/browser plan-mode benchmark you described,
record runtime and UX friction, inspect the generated plan quality, and
then use those results to decide whether the next PR should target UI
feedback, plan quality, or the generalisation boundary.
—
Reply to this email directly, view it on GitHub
<#23?email_source=notifications&email_token=ABVNN4DDH3GNEJPNNKDRUQ345TNRJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHA4TINRRGIZKM4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4588946122>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVNN4EDC7HQSR2NQG5DJJ345TNRJAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBYHE2DMMJSGI>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Kesavan
|
|
I tried to run the empirical plan-mode benchmark you suggested. Short version: I got partway through local setup and found a few concrete blockers, so I could not honestly report a completed plan-synthesis quality/runtime benchmark yet. What I did:
Findings:
Measured preflight timings, not plan-synthesis timings:
So: I was able to partially set up Gently locally enough to start the server, render the web UI, connect the chat WebSocket, and enter plan mode. I was not able to complete the benchmark you asked for because the environment lacks a real Anthropic API key, lacks the The useful outcome is that this preflight found a real UI startup blocker, and I pushed the fix to #32. The remaining work to complete your requested benchmark is:
I agree with your newer comment on #32 that this now turns into a design question: plan generation needs to be faster and more intentionally structured, probably by reducing tool-call chatter around campaign/phase/task creation while preserving enough structure to enrich the biologist's discovery workflow rather than turning it into a rigid form. |
|
Thanks. I did try to set Gently up locally and run the empirical offline plan-mode benchmark. Short version: I got to a useful preflight, but not a true plan-quality benchmark yet. What I did:
Findings from that preflight:
Why I could not complete the exact benchmark you asked for:
So the honest status is: I was able to set Gently up only partially. After the route fix in #32, the app can serve the web UI and This also reinforces the next design point from #32: before optimizing the plan generation UX, we need the local offline plan-mode path to be reliable, then benchmark the actual campaign -> phase -> task creation flow. |
|
Great feedback! Will start an issue on this soon.
…On Mon, 1 Jun 2026 at 00:23, ceej640 ***@***.***> wrote:
*ceej640* left a comment (gently-project/gently#23)
<#23 (comment)>
Thanks. I did try to set Gently up locally and run the empirical offline
plan-mode benchmark.
Short version: I got to a useful preflight, but not a true plan-quality
benchmark yet.
What I did:
- Started from the PR #32
<#32> branch and tried python
launch_gently.py --offline --no-browser with disposable storage
outside the repo.
- The first launch failed before the UI came up because
gently_perception is imported at startup but is not installed in this
environment, and I could not find it declared in pyproject.toml or
requirements.txt.
- To continue only the UI preflight, I used a temporary local shim for
gently_perception; that was not a repo change.
- With that shim, the web server started but / returned HTTP 500
because the routes were using the older positional TemplateResponse
call shape. The installed Starlette/FastAPI expects TemplateResponse(request,
name, context).
- I fixed that route compatibility issue in #32
<#32> and pushed commit
aa79f0f. After that, / returned HTTP 200.
- Since browser MCP/node_repl was not exposed in this Codex session
and Playwright is not installed here, I drove the same /ws/agent
protocol the UI uses as a fallback.
Findings from that preflight:
- WebSocket open: ~0.087s
- initial connected message: ~0.002s after open
- /plan command response: ~0.004s
- /plan successfully switched to plan mode and returned: "Switched to
plan mode. I'm now your experimental design collaborator."
- The first synthesis chat turn stopped after ~0.357s with Anthropic 401
invalid x-api-key.
Why I could not complete the exact benchmark you asked for:
- There is no ANTHROPIC_API_KEY in the process/user/machine
environment here, so true plan synthesis cannot run.
- The missing gently_perception dependency means local offline startup
is not clean unless that package is installed, declared, or made optional
for plan-only/offline use.
- Browser MCP control was not available in this session, and
Playwright is absent, so I could not complete a visible browser
interaction/screenshot benchmark; I used a WebSocket fallback instead.
So the honest status is: I was able to set Gently up only partially. After
the route fix in #32 <#32>,
the app can serve the web UI and /plan mode switching works, but I cannot
report real plan synthesis latency or plan quality until the API key plus
setup/tooling gaps are resolved.
This also reinforces the next design point from #32
<#32>: before optimizing the
plan generation UX, we need the local offline plan-mode path to be
reliable, then benchmark the actual campaign -> phase -> task creation flow.
—
Reply to this email directly, view it on GitHub
<#23?email_source=notifications&email_token=ABVNN4AYGGCCOAHSIGTUXLL45UALNA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHE2TIMRRGQ22M4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4589542145>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVNN4CJCHPMMSEMJBOI4L345UALNAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBZGU2DEMJUGU>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Kesavan
|
Summary
Related issues
Verification